AITopics | quantization threshold

Collaborating Authors

quantization threshold

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Liu, Wenyuan, Meng, Haoqian, Luo, Yilun, Zhang, Peng, Ma, Xindian

arXiv.org Artificial IntelligenceAug-5-2025

Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.02343

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Oscillation-Reduced MXFP4 Training for Vision Transformers

Chen, Yuxiang, Xi, Haocheng, Zhu, Jun, Chen, Jianfei

arXiv.org Artificial IntelligenceFeb-28-2025

Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training

oscillation, oscillation-reduced mxfp4 training, q-ramping, (13 more...)

arXiv.org Artificial Intelligence

2502.20853

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > Promising Solution (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Oscillations Make Neural Networks Robust to Quantization

Wenshøj, Jonathan, Pepin, Bob, Selvan, Raghavendra

arXiv.org Artificial IntelligenceFeb-1-2025

We challenge the prevailing view that oscillations in Quantization Aware Training (QAT) are merely undesirable artifacts caused by the Straight-Through Estimator (STE). Through theoretical analysis of QAT in linear models, we demonstrate that the gradient of the loss function can be decomposed into two terms: the original full-precision loss and a term that causes quantization oscillations. Based on these insights, we propose a novel regularization method that induces oscillations to improve quantization robustness. Contrary to traditional methods that focuses on minimizing the effects of oscillations, our approach leverages the beneficial aspects of weight oscillations to preserve model performance under quantization. Our empirical results on ResNet-18 and Tiny ViT demonstrate that this counter-intuitive strategy matches QAT accuracy at >= 3-bit weight quantization, while maintaining close to full precision accuracy at bits greater than the target bit. Our work therefore provides a new perspective on model preparation for quantization, particularly for finding weights that are robust to changes in the bit of the quantizer -- an area where current methods struggle to match the accuracy of QAT at specific bits.

artificial intelligence, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2502.0049

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Jain, Sambhav R., Gural, Albert, Wu, Michael, Dick, Chris H.

arXiv.org Artificial IntelligenceFeb-28-2020

We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshold gradients allows for a natural range-precision trade-off leading to better optima. Our quantizers are constrained to use power-of-2 scale-factors and per-tensor scaling of weights and activations to make it amenable for hardware implementations. We present analytical support for the general robustness of our methods and empirically validate them on various CNNs for ImageNet classification. We are able to achieve near-floating-point accuracy on traditionally difficult networks such as MobileNets with less than 5 epochs of quantized (8-bit) retraining. Finally, we present Graffitist, a framework that enables automatic quantization of TensorFlow graphs for TQT (available at https://github.com/Xilinx/graffitist ).

gradient, quantization, threshold, (13 more...)

arXiv.org Artificial Intelligence

1903.08066

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.64)

Add feedback

Model-Aware Deep Architectures for One-Bit Compressive Variational Autoencoding

Khobahi, Shahin, Soltanalian, Mojtaba

arXiv.org Machine LearningNov-27-2019

Parameterized mathematical models play a central role in understanding and design of complex information systems. However, they often cannot take into account the intricate interactions innate to such systems. On the contrary, purely data-driven approaches do not need explicit mathematical models for data generation and have a wider applicability at the cost of interpretability. In this paper, we consider the design of a one-bit compressive variational autoencoder, and propose a novel hybrid model-based and data-driven methodology that allows us not only to design the sensing matrix and the quantization thresholds for one-bit data acquisition, but also allows for learning the latent-parameters of iterative optimization algorithms specifically designed for the problem of one-bit sparse signal recovery. In addition, the proposed method has the ability to adaptively learn the proper quantization thresholds, paving the way for amplitude recovery in one-bit compressive sensing. Our results demonstrate a significant improvement compared to state-of-the-art model-based algorithms.

algorithm, iteration, quantization threshold, (14 more...)

arXiv.org Machine Learning

1911.1241

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fast Adjustable Threshold For Uniform Neural Network Quantization

Goncharenko, Alexander, Denisov, Andrey, Alyamkin, Sergey, Terentev, Evgeny

arXiv.org Machine LearningDec-19-2018

Neural network quantization procedure is the necessary step for porting of neural networks to mobile devices. Quantization allows accelerating the inference, reducing memory consumption and model size. It can be performed without fine-tuning using calibration procedure (calculation of parameters necessary for quantization), or it is possible to train the network with quantization from scratch. Training with quantization from scratch on the labeled data is rather long and resource-consuming procedure. Quantization of network without fine-tuning leads to accuracy drop because of outliers which appear during the calibration. In this article we suggest to simplify the quantization procedure significantly by introducing the trained scale factors for quantization thresholds. It allows speeding up the process of quantization with fine-tuning up to 8 epochs as well as reducing the requirements to the set of train images. By our knowledge, the proposed method allowed us to get the first public available quantized version of MNAS without significant accuracy reduction - 74.8% vs 75.3% for original full-precision network.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Machine Learning

1812.07872

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback